0X01 常用的 python 库

1.urllib

import urllib
import urllib.request
urllib.request.urlopen("http://www.baidu.com")

2.re

3.requests

4.selenimu

这个库是配合一些驱动去爬取动态渲染网页的库

(1)chromedriver

我们使用的时候需要先下载一个 chromedriver.exe ,下载好了以后放在 chrome.exe 的相同目录下(默认安装路径),然后将这个目录放作为 PATH

import selenium
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("http://www.baidu.com")
driver.page_source

这种方式的唯一的缺点是会出现浏览器界面,这可能是我们不需要的,所以我们可以使用 headless 的方式来隐藏 web 界面(其实就是使用 options() 对象的 add_argument 属性去设置 headless 参数 )

import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
import time

chrome_options = Options()
chrome_options.add_argument("--headless")

base_url = "http://www.baidu.com/"
#对应的chromedriver的放置目录
driver = webdriver.Chrome(executable_path=(r'C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe'), chrome_options=chrome_options)

driver.get(base_url + "/")

start_time=time.time()
print('this is start_time ',start_time)

driver.find_element_by_id("kw").send_keys("selenium webdriver")
driver.find_element_by_id("su").click()
driver.save_screenshot('screen.png')

driver.close()

end_time=time.time()
print('this is end_time ',end_time)

(2)phantomJS

这是另一种无界面的实现方法,虽然说不维护了,并且在使用的过程中会出现各种玄学,但是还是要介绍一下

和 Chromedriver 一样,我们首先要去下载 phantomJS,然后将其放在 PATH 中方便我们后面的调用

import selenium
from selenium import webdriver

driver = webdriver.phantomJS()
driver.get("http://www.baidu.com")
driver.page_source

5.lxml

这个是为 XPATH 的使用准备的库

6.beautifulsoup

pip 安装的时候注意一下要安装 beautifulsoup4,表示第四个版本,并且这个库是依赖于 lxml 的,所以安装之前请先安装 lxml

from bs4 import BeautifulSoup
soup = BeautifulSoup('`<html></html>','lxml')

7.pyquery

和 BeautifulSoup 一样也是一个网页解析库,但是相对来讲语法简单一些(语法是模仿 jQuery 的)

from pyquery import PyQuery as pq

page = pq('`<html>hello world</html>`')
result = page('html').text()
result

8.pymysql

这个库是 py 操纵 Mysql 的库

import pymysql

conn = pymysql.connect(host='localhost',user='root',password='root',port=3306,db='test')
cursor = conn.cursor()
result = cursor.execute('select * from user where id = 1')
print(cursor.fetchone())

9.pymango

import pymango

client = pymango.MongoClient('localhost')
db = client('newtestdb')
db['table'].insert({'name':'Bob'})
db['table'].find_one({'name':'Bob'})

10.redis

import redis

r = redis.Redis('localhost',6379)
r.set("name","Bob")
r.get('name')

11.flask

flask 在后期使用代理的时候可能会用到

from flask import Flask

app = Flask(__name__)

@app.route('/')

def hello():
    return "hello world"

if __name__ == '__main__':

    app.run(debug=True)

12.django

在分布式爬虫的维护方面可能会用到 django

13.jupyter

网页端记事本

0X02 基础部分

1.爬虫基本原理

(1)爬虫是什么

爬虫就是请求网页并且提取数据的自动化工具

(2)爬虫的基本流程

1.发起请求:

通过 HTTP 库向目标网站发起请求,即发送一个 request(可以包含额外的header信息),然后等待服务器的响应

2.获取响应内容

如果服务器正常响应,会得到一个 Response.其内容就是所要获取的页面的内容,类型可以是 HTML、JSON、二进制数据(图片视频)等

3.解析内容

对 HTML 数据可以使用正则表达式、网页解析库进行解析。如果是 Json 则可以转化成 JSON 对象解析,如果是二进制数据可以保存或者进一步处理

4.保存数据

保存的形式多样,可以是纯文本,也可以保存成数据库,或者保存为特定格式的文件

(3)请求的基本元素

1.请求方法
2.请求 URL
3.请求头
4.请求体(POST 方法独有)

(4)请响应的基本元素

1.状态码
2.响应头
3.响应体

(5)实例代码:

1.请求网页数据
import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
res = requests.get("http://www.baidu.com",headers=headers)

print(res.status_code)
print(res.headers)
print(res.text)

当然这里使用的是 res.text 这种文本格式,如果返回的是一个二进制格式的数据(比如图片),那么我们应该使用 res.content

2.请求二进制数据
import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
res = requests.get("https://ss2.bdstatic.com/lfoZeXSm1A5BphGlnYG/icon/95486.png",headers=headers)
print(res.content)

with open(r'E:\桌面\1.png','wb') as f:
    f.write(res.content)
    f.close()

(6)解析方式

1.直接处理
2.转化成 json对象
3.正则匹配
4.BeautifulSoap
5.PyQuery
6.XPath

(7)response 的结果为什么和浏览器中的看到的不同

我们使用脚本去请求(只是一次请求)网页得到的是最原始的网页的源码,这个源码里面会有很多的远程的 js 和 css 的加载,我们的脚本是没法解析的,但是浏览器能对这些远程的链接进行再次的请求,然后利用加载到的数据对页面进行进一步的加载和渲染,于是我们在浏览器中看到的页面是很多请求渲染得到的结果,因此和我们一次请求的到的页面肯定是不一样的。

(8)如何解决 JS 渲染的问题

解决问题的方法本质上就是模拟浏览器的加载渲染,然后将渲染好的页面进行返回

1.分析Ajax 请求
2.selenium+webdriver(推荐)
3.splash
4.PyV8、Ghostpy

(9)如何存储数据

1.纯本文
2.关系型数据库
3.非关系型数据库
4.二进制文件

0X03 Urllib 库

1.什么是 Urllib 库

这个库是 python 的内置的一个请求库

urllib.request —————–>请求模块
urllib.error——————–>异常处理模块
urllib.parse——————–>url解析模块
urllib.robotparser ————>robots.txt 解析模块

2.urllib 库的基本使用

(1)函数调用原型

urllib.request.urlopen(url,data,timeout...)

(2)实例代码一:GET 请求

import urllib.request

res = urllib.request.urlopen("http://www.baidu.com")
print(res.read().decode('utf-8'))

(3)实例代码二:POST 请求

import urllib.request
import urllib.parse
from pprint import pprint

data = bytes(urllib.parse.urlencode({'world':'hello'}),encoding = 'utf8')
res = urllib.request.urlopen('https://httpbin.org/post',data = data)
pprint(res.read().decode('utf-8'))

(4)实例代码三:超时设置

import urllib.request

res = urllib.request.urlopen("http://httpbin.org.get",timeout = 1)
print(res.read().decode('utf-8'))

(5)实例代码:获取响应状态码、响应头、响应体

import urllib.request

res = urllib.request.urlopen("http://httpbin.org/get")
print(res.status)
print(res.getheaders())
print(res.getheader('Server'))
#获取响应体的使用 read() 的结果是 Bytes 类型,我们还要用 decode('utf-8')转化成字符串
print(res.read().decode('utf-8'))

(6) request 对象

from urllib import request,parse
from pprint import pprint

url = "https://httpbin.org/post"
headers = {
    'User-Agent':'hello wolrd',
    'Host':'httpbin.org'
}
dict = {
    'name':'Tom',
}

data = bytes(parse.urlencode(dict),encoding='utf8')
req = request.Request(url=url,data=data,headers=headers,method='POST')
res = request.urlopen(req)
pprint(res.read().decode('utf-8'))

3.urllib 库的进阶使用

(1)代理

import urllib.request

proxy_handler = urllib.request.ProxyHandler({
    'http':'http://127.0.0.1:9743'
})

opener = urllib.request.build_opener(proxy_handler)
res = opener.open('https://www.taobao.com')
print(res.read())
1.获取 cookies
import http.cookiejar
import urllib.request

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open("http://www.baidu.com")
for item in cookie:
    print(item.name+"="+item.value)

格式一:

import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

格式二:

import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)
import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

(3)异常处理

1.实例代码一:URLError
from urllib import request
from urllib import error

try:
    urllib.request.urlopen("http://httpbin.org/xss")
except error.URLError as e:
    print(e.reason)
2.实例代码二:HTTPError
from urllib import request, error

try:
    response = request.urlopen('http://httpbin.org/xss')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')
3.实例代码三:异常类型判断
import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print('TIME OUT')

(4)URL 解析工具类

1.urlparse
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
2.urlunparse
from urllib.parse import urlunparse

data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
3.urljoin
from urllib.parse import urljoin

print(urljoin('http://www.baidu.com', 'FAQ.html'))
4.urlencode
from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

0X04 Requests 库

1.什么是 requests 库

这个库是基于 URLlib3 的,改善了 urllib api 比较繁琐的特点,使用几句简单的语句就能实现设置 cookie 和设置代理的功能,非常的方便

2.requests 库的基本使用

(1)获取响应信息

import requests

res = requests.get("http://www.baidu.com")
print(res.status_code)
print(res.text)
print(res.cookies)

(2)各种请求方法

import requests

requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.head("http://httpbin.org/get")
requests.delete("http://httpbin.org/delete")
requests.options("http://httpbin.org/get")

(3)带参数的 get 请求

import requests

params = {
    'id':1,
    'user':'Tom',
    'pass':'123456'
}

res = requests.get('http://httpbin.org/get',params = params )
print(res.text)

(4)解析 json

import requests

res = requests.get("http://httpbin.org/get")
print(res.json())

(5)获取二进制数据

import requests

response = requests.get("https://github.com/favicon.ico")
with open('favicon.ico', 'wb') as f:
    f.write(response.content)
    f.close()

(6)添加 headers

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
}
response = requests.get("https://www.zhihu.com/explore", headers=headers)
print(response.text)

(7)POST 请求

import requests

data = {
    'id':1,
    'user':'Tom',
    'pass':'123456',
}

res = requests.post('http://httpbin.org/post',data=data)
print(res.text)

(8) response 属性

import requests

data = {
    'id':1,
    'user':'Tom',
    'pass':'123456',
}

res = requests.post('http://httpbin.org/post',data=data)
print(res.text)
print(res.status_code)
print(res.headers)
print(res.cookies)
print(res.history)
print(res.url)

(9)响应状态码

每一个状态码都对应着一个名字,我们只要调用这个名字就可以进行判断了

100: ('continue',),
101: ('switching_protocols',),
102: ('processing',),
103: ('checkpoint',),
122: ('uri_too_long', 'request_uri_too_long'),
200: ('ok', 'okay', 'all_ok', 'all_okay', 'all_good', '\\o/', '✓'),
201: ('created',),
202: ('accepted',),
203: ('non_authoritative_info', 'non_authoritative_information'),
204: ('no_content',),
205: ('reset_content', 'reset'),
206: ('partial_content', 'partial'),
207: ('multi_status', 'multiple_status', 'multi_stati', 'multiple_stati'),
208: ('already_reported',),
226: ('im_used',),

# Redirection.
300: ('multiple_choices',),
301: ('moved_permanently', 'moved', '\\o-'),
302: ('found',),
303: ('see_other', 'other'),
304: ('not_modified',),
305: ('use_proxy',),
306: ('switch_proxy',),
307: ('temporary_redirect', 'temporary_moved', 'temporary'),
308: ('permanent_redirect',
      'resume_incomplete', 'resume',), # These 2 to be removed in 3.0

# Client Error.
400: ('bad_request', 'bad'),
401: ('unauthorized',),
402: ('payment_required', 'payment'),
403: ('forbidden',),
404: ('not_found', '-o-'),
405: ('method_not_allowed', 'not_allowed'),
406: ('not_acceptable',),
407: ('proxy_authentication_required', 'proxy_auth', 'proxy_authentication'),
408: ('request_timeout', 'timeout'),
409: ('conflict',),
410: ('gone',),
411: ('length_required',),
412: ('precondition_failed', 'precondition'),
413: ('request_entity_too_large',),
414: ('request_uri_too_large',),
415: ('unsupported_media_type', 'unsupported_media', 'media_type'),
416: ('requested_range_not_satisfiable', 'requested_range', 'range_not_satisfiable'),
417: ('expectation_failed',),
418: ('im_a_teapot', 'teapot', 'i_am_a_teapot'),
421: ('misdirected_request',),
422: ('unprocessable_entity', 'unprocessable'),
423: ('locked',),
424: ('failed_dependency', 'dependency'),
425: ('unordered_collection', 'unordered'),
426: ('upgrade_required', 'upgrade'),
428: ('precondition_required', 'precondition'),
429: ('too_many_requests', 'too_many'),
431: ('header_fields_too_large', 'fields_too_large'),
444: ('no_response', 'none'),
449: ('retry_with', 'retry'),
450: ('blocked_by_windows_parental_controls', 'parental_controls'),
451: ('unavailable_for_legal_reasons', 'legal_reasons'),
499: ('client_closed_request',),

# Server Error.
500: ('internal_server_error', 'server_error', '/o\\', '✗'),
501: ('not_implemented',),
502: ('bad_gateway',),
503: ('service_unavailable', 'unavailable'),
504: ('gateway_timeout',),
505: ('http_version_not_supported', 'http_version'),
506: ('variant_also_negotiates',),
507: ('insufficient_storage',),
509: ('bandwidth_limit_exceeded', 'bandwidth'),
510: ('not_extended',),
511: ('network_authentication_required', 'network_auth', 'network_authentication'),

实例代码:

import requests

response = requests.get('http://www.jianshu.com/hello.html')
exit() if not response.status_code == requests.codes.not_found else print('404 Not Found')

3.requests 库的进阶使用

(1)文件上传

import requests

files = {'file':open('E:\\1.png','rb')}

res= requests.post('http://httpbin.org/post',files=files)
print(res.text)

(2)获取 cookies

import requests

res = requests.get("http://www.baidu.com")
for key,value in res.cookies.items():
    print(key + "=" + value)

(3)会话维持

这个用法非常的重要,在我们的模拟登陆的过程中是必然会用到的方法,在 CTF 的写脚本的过程中也经常会用到,所以我们稍微详细解释一下

我们在使用 requests.get 的时候要明确一点就是,我们每使用一个 requests.get 就相当于重新打开了一个浏览器,因此上一个 requests.get 中设置的 cookie 在下面的第二次请求中是不能同步的,我们来看下面的例子

实例代码:

import requests

#这里我们设置了 cookie 
requests.get('http://httpbin.org/cookies/set/number/123456789')
#我们再次发起请求,查看是否能带着我们设置的 cookie 
res = requests.get('http://httpbin.org/cookies')
print(res.text)

运行结果:

{
  "cookies": {}
}

我们发现,正如我们上面分析的,我们第一次访问设置的 cookie 并没有在第二次访问中生效,那么怎么办呢,我们有一个 session() 方法能帮助我们解决这个问题

实例代码:

import requests

s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
res = s.get('http://httpbin.org/cookies')
print(res.text)

运行结果:

{
  "cookies": {
    "number": "123456789"
  }
}

(4)证书验证

我们在访问 https 的网站的时候浏览器首先会对网站的证书进行校验,如果发现这个证书不是官方授权的话就会出现警告页面而不会继续访问该网站,对于爬虫来讲就会抛出异常,那如果我们想要让爬虫忽略证书的问题继续访问这个网站的话就要对其进行设置

1.忽略证书验证
import requests

response = requests.get('https://www.heimidy.cc/',verify=False)
print(response.status_code)

但是此时是存在一个警告的,我们可以通过导入 urilib3 的包,并调用消除 warning 的方法来消除这个警告

import requests
from requests.packages import urllib3

urllib3.disable_warnings()
response = requests.get('https://www.heimidy.cc/',verify=False)
print(response.status_code)
2.手动指定本地证书进行验证
import requests

response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

(5)代理设置

除了常见到的 https 和 http 代理以=以外,我们还可以使用 socks 代理,不过需要 pip 安装一个 requests[socks] 包

import requests

proxies = {
    "http":"http://127.0.0.1:1080",
    "https":"https://127.0.0.1:1080"
}

res = requests.get("https://www.google.com",proxies=proxies)
print(res.status_code)

这里有一个疑问就是我是用 socks 代理访问 google 是失败的,会报错

实例代码:

import requests

proxies = {
    "http":"socks5://127.0.0.1:1080",
    "https":"socks5://127.0.0.1:1080"
}

res = requests.get("https://www.google.com",proxies=proxies,verify=False)
print(res.status_code)

运行结果:

SSLError: SOCKSHTTPSConnectionPool(host='www.google.com', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError("bad handshake: SysCallError(-1, 'Unexpected EOF')")))

试了一些方法都没什么效果,有待以后考证

(6)超时设置

import requests
from requests.exceptions import ReadTimeout
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')

(7)Basic 认证

实例代码一:

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://120.27.34.24:9001', auth=HTTPBasicAuth('user', '123'))
print(r.status_code)

实例代码二:

import requests

r = requests.get('http://120.27.34.24:9001', auth=('user', '123'))
print(r.status_code)

(8)异常处理

import requests
from requests.exceptions import ReadTimeout, ConnectionError, RequestException
try:
    response = requests.get("http://httpbin.org/get", timeout = 0.5)
    print(response.status_code)
except ReadTimeout:
    print('Timeout')
except ConnectionError:
    print('Connection error')
except RequestException:
    print('Error')

0X05 正则表达式

1.什么是正则表达式

正则表达式是对字符串进行操作的一种逻辑公式,用事先定义好的一些特定的字符,以及这些字符的组合,组成一个规则字符串,用这个规则字符串去表达对字符串的一种过滤的逻辑,在python 中使用过re 库来实现

2.常见的匹配模式

此处输入图片的描述

3.re.match

re.match(pattern, string, flags=0)

(1)常规匹配

span() 方法是返回匹配的范围,group() 是返回匹配的结果

实例代码:

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
res = re.match('^\w{5}\s\d{3}\s\d{4}\s\w{10}.*Demo$',content)
print(res.span())
print(res.group())

(2)泛匹配

import re

content = 'Hello 123 4567 World_This is a Regex Demo'
res = re.match('^Hello.*Demo$',content)
print(res.span())
print(res.group())

(3)匹配具体内容

我们如果想匹配具体的内容,我们可以用小括号将其括起来

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^Hello\s(\d+)\s.*Demo$',content)
print(res.span(1))
print(res.group(1))

(4)贪婪与非贪婪模式

所谓贪婪模式指的就是.* 会匹配尽可能多的字符,我们来看下面的例子

实例代码:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^He.*(\d+).*Demo$',content)
print(res.span(1))
print(res.group(1))

运行结果:

(12, 13)
7

我们的本意是想匹配 1234567 这个字符串,但是实际上我们只匹配到了 7 ,因为.*默认的贪婪模式将123456匹配掉了,那么为了解决这个问题,我们可以使用 去消除非贪婪模式

实例代码:

import re

content = 'Hello 1234567 World_This is a Regex Demo'
res = re.match('^He.*?(\d+).*Demo$',content)
print(res.span(1))
print(res.group(1))

运行结果:

(6, 13)
1234567

(5)匹配模式

匹配模式是用来解决一些细节问题的,比如匹配中的是否区分大小写、是否能匹配换行符等

实例代码:

import re

content = '''Hello 1234567 World_This is 

a Regex Demo'''

res = re.match('^He.*?(\d+).*Demo$',content,re.S)
print(res.span(1))
print(res.group(1))

运行结果:

(6, 13)
1234567

可以发现我们的 .* 本来是不能匹配换行符的,但是我们使用了 re.S 的模式以后就可以正常匹配了

(6)转义字符

如果在待匹配字符串中出现了正则表达式中的特殊字符,我们要对其进行转义操作

import re

content = 'price is $5.00'
res = re.match('price is \$5\.00', content)
print(res.group())

我们上面介绍的 re.match 有一个弊端就是它只能从开头开始匹配,也就是说如果我们的正则不匹配第一个字符那么是无法匹配中间的字符的,所以我们还有一个武器叫 re.search,它会扫描整个字符串并返回第一个成功的匹配。

实例代码:

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
res = re.search('Hello.*?(\d+).*?Demo', content)
print(res.group(1))

运行结果:

1234567 

因为这个特性能大大减少我们写正则的难度,因此,我们在能用 search 的情况下就不要用 match

匹配练习:

实例代码:

import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君"><i class="fa fa-user"></i>但愿人长久</a>
        </li>
    </ul>
</div>'''

res = re.search('<li.*?/2\.mp3.*?singer="(.*?)">(.*?)</a>',html,re.S)
print(res.group(1),res.group(2))

运行结果:

任贤齐 沧海一声笑

5.re.findall

与之前两个不同的是 re.findall 搜会索字符串,并以列表形式返回全部能匹配的子串。

匹配练习一:

实例代码:

import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''

res = re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>',html,re.S)
#print(res)
for i in res:
    print(i[0],i[1],i[2])

运行结果:

[('/2.mp3', '任贤齐', '沧海一声笑'), ('/3.mp3', '齐秦', '往事随风'), ('/4.mp3', 'beyond', '光辉岁月'), ('/5.mp3', '陈慧琳', '记事本'), ('/6.mp3', '邓丽君', '但愿人长久')]
/2.mp3 任贤齐 沧海一声笑
/3.mp3 齐秦 往事随风
/4.mp3 beyond 光辉岁月
/5.mp3 陈慧琳 记事本
/6.mp3 邓丽君 但愿人长久

匹配练习二:

实例代码:

import re

html = '''<div id="songs-list">
    <h2 class="title">经典老歌</h2>
    <p class="introduction">
        经典老歌列表
    </p>
    <ul id="list" class="list-group">
        <li data-view="2">一路上有你</li>
        <li data-view="7">
            <a href="/2.mp3" singer="任贤齐">沧海一声笑</a>
        </li>
        <li data-view="4" class="active">
            <a href="/3.mp3" singer="齐秦">往事随风</a>
        </li>
        <li data-view="6"><a href="/4.mp3" singer="beyond">光辉岁月</a></li>
        <li data-view="5"><a href="/5.mp3" singer="陈慧琳">记事本</a></li>
        <li data-view="5">
            <a href="/6.mp3" singer="邓丽君">但愿人长久</a>
        </li>
    </ul>
</div>'''

res = re.findall('<li.*?>\s*?(<a.*?>)?(\w+)(</a>)?\s*?</li>',html,re.S)
for i in res:
    print(i[0],i[1],i[2])

运行结果:

 一路上有你 
<a href="/2.mp3" singer="任贤齐"> 沧海一声笑 </a>
<a href="/3.mp3" singer="齐秦"> 往事随风 </a>
<a href="/4.mp3" singer="beyond"> 光辉岁月 </a>
<a href="/5.mp3" singer="陈慧琳"> 记事本 </a>
<a href="/6.mp3" singer="邓丽君"> 但愿人长久 </a>

6.re.sub

该方法的作用是替换字符串中每一个匹配的子串后返回替换后的字符串

实例代码一:

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
res = re.sub('\d+','K0rz3n',content)
print(res)

运行结果:

Extra stings Hello K0rz3n World_This is a Regex Demo Extra stings

有时候我们替换的时候需要保留原始字符串,这个时候我们就要使用正则表达式的后向引用技术

实例代码二:

import re

content = 'Extra stings Hello 1234567 World_This is a Regex Demo Extra stings'
content = re.sub('(\d+)', '\\1 8910', content)
print(content)

运行结果:

Extra stings Hello 1234567 890 World_This is a Regex Demo Extra stings

7.re.compile

该方法可以将正则表达式转换成正则表达式对象,方便我们后期的复用

实例代码:

import re

content = '''Hello 1234567 World_This
is a Regex Demo'''
pattern = re.compile('Hello.*Demo', re.S)
res = re.match(pattern, content)
print(res.group(0))

8.实战练习爬取豆瓣读书

import requests
import re
content = requests.get('http://book.douban.com/').text
pattern = re.compile('<li.*?cover.*?href="(.*?)".*?title="(.*?)".*?more-meta.*?author">(.*?)</span>.*?year">(.*?)</span>.*?</li>', re.S)
results = re.findall(pattern, content)
for result in results:
    url, name, author, date = result
    author = re.sub('\s', '', author)
    date = re.sub('\s', '', date)
    print(url, name, author, date)

0X06 BeautifulSoup

1.什么是 BeautifulSoup

这是一个方便的网页解析库,不用编写正则就是实现网页信息的提取

2.常见的配合使用的解析库

此处输入图片的描述

3.基本使用

实例代码:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.prettify())
print(soup.title.string)

运行结果:

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The Dormouse's story

4.标签选择器

(1)选择元素

使用soup.(点)属性标签的方式来进行选择,如果有多个符合的话只能返回第一个匹配的标签

实例代码:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.head)
print(soup.title)
print(soup.p)

运行结果:

<head><title>The Dormouse's story</title></head>
<title>The Dormouse's story</title>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>

(2)获取属性

实例代码:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

运行结果:

dromouse
dromouse

(3)获取内容

实例代码:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p clss="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.string)

运行结果:

The Dormouse's story

(4)嵌套选择

实例代码:

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.head.title.string)

运行结果:

The Dormouse's story

(5)获取子孙节点

1.contents

这种方法是以列表形式返回标签的子节点

实例代码:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)

运行结果:

['\n            Once upon a time there were three little sisters; and their names were\n            ', <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>, '\n', <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, ' \n            and\n            ', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '\n            and they lived at the bottom of a well.\n        ']
2.children

这种方法返回的是一个子节点的迭代器形式

实例代码:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.children)
for i, child in enumerate(soup.p.children):
    print(i, child)

运行结果:

<list_iterator object at 0x1064f7dd8>
0 
            Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
4  
            and

5 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
6 
            and they lived at the bottom of a well.
3.descendants

返回子孙节点,其实和上面 children 的不同在于这个方法会再次强调一下孙子节点

实例代码:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.descendants)
for i, child in enumerate(soup.p.descendants):
    print(i, child)

运行结果:

<generator object descendants at 0x10650e678>
0 
            Once upon a time there were three little sisters; and their names were

1 <a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
2 

3 <span>Elsie</span>
4 Elsie
5 

6 

7 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
8 Lacie
9  
            and

10 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
11 Tillie
12 
            and they lived at the bottom of a well.

(6)父节点和祖先节点

1.parent

实例代码:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.a.parent)

运行结果:

<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<span>Elsie</span>
</a>
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> 
and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 and they lived at the bottom of a well.
</p>
2.parents

以列表的形式输出所有的祖先节点

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.parents)))

(7)兄弟节点

实例代码:

html = """
<html>
    <head>
        <title>The Dormouse's story</title>
    </head>
    <body>
        <p class="story">
            Once upon a time there were three little sisters; and their names were
            <a href="http://example.com/elsie" class="sister" id="link1">
                <span>Elsie</span>
            </a>
            <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> 
            and
            <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
            and they lived at the bottom of a well.
        </p>
        <p class="story">...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(list(enumerate(soup.a.next_siblings)))
print(list(enumerate(soup.a.previous_siblings)))

运行结果:

[(0, '\n'), (1, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (2, ' \n            and\n            '), (3, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (4, '\n            and they lived at the bottom of a well.\n        ')]
[(0, '\n            Once upon a time there were three little sisters; and their names were\n            ')]

5.标准选择器

上面我们讲述的标签选择器虽然选择速度快,但是选择的内容也是比较笼统的,在现实中很难满足我们的需求,于是我们就需要更强大的选择器帮助我们去实现

find_all( name , attrs , recursive , text , **kwargs )

可根据标签名、属性、内容查找文档

(1) name

实例代码一:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
soup.find_all('ul')

运行结果:

[<ul class="list" id="list-1">
 <li class="element">Foo</li>
 <li class="element">Bar</li>
 <li class="element">Jay</li>
 </ul>, <ul class="list list-small" id="list-2">
 <li class="element">Foo</li>
 <li class="element">Bar</li>
 </ul>]

如果我们还想获取更里面的标签,我们可以再次对获取到的 ul 标签使用 find_all()

实例代码二:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
for i in soup.find_all('ul'):
    print(i.find_all('li'))

运行结果:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

(2)attrs

传入想要定位的属性键值对,就能成功定位

实例代码一:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))

运行结果:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]

或者,如果 你觉得这种方式比较复杂的话我们还可以使用更加简单的直接使用等于号链接属性和值作为参数传入来定位

实例代码二:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1" name="elements">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id = 'list-1'))
print(soup.find_all(class_ = 'panel-heading'))

运行结果:

[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<div class="panel-heading">
<h4>Hello</h4>
</div>]

注意:

Class 是 python 中的关键字,因此我们在写属性的时候不能直接写 classs,否则会引起歧义,所以我们改成了 class_

(3)text

实例代码:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all(text='Foo'))

运行结果:

['Foo', 'Foo']

(4)其他

find( name , attrs , recursive , text , **kwargs )
find_all() 返回所有元素,而find()返回单一元素

find_parents() find_parent()
find_parents()返回所有祖先节点,find_parent()返回直接父节点。

find_next_siblings() find_next_sibling()
find_next_siblings()返回后面所有兄弟节点,find_next_sibling()返回后面第一个兄弟节点。

find_previous_siblings() find_previous_sibling()
find_previous_siblings()返回前面所有兄弟节点,find_previous_sibling()返回前面第一个兄弟节点。

find_all_next() find_next()
find_all_next()返回节点后所有符合条件的节点, find_next()返回第一个符合条件的节点

find_all_previous() 和 find_previous()
find_all_previous()返回节点后所有符合条件的节点, find_previous()返回第一个符合条件的节点

6.CSS选择器

(1)基本使用

通过select()直接传入CSS选择器即可完成选择

实例代码一:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel-heading'))
print(soup.select('#list-1'))
print(soup.select('li'))

运行结果:

[<div class="panel-heading">
<h4>Hello</h4>
</div>]
[<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]

实例代码二:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul.select('li'))

运行结果:

[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]

(2)获取属性

实例代码:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

运行结果:

list-1
list-1
list-2
list-2

(3)获取内容

实例代码:

html='''
<div class="panel">
    <div class="panel-heading">
        <h4>Hello</h4>
    </div>
    <div class="panel-body">
        <ul class="list" id="list-1">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
            <li class="element">Jay</li>
        </ul>
        <ul class="list list-small" id="list-2">
            <li class="element">Foo</li>
            <li class="element">Bar</li>
        </ul>
    </div>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for li in soup.select('li'):
    print(li.get_text())

运行结果:

Foo
Bar
Jay
Foo
Bar

0X07 PyQuery

PyQuery 是另一个比较强大的网页解析库,语法完全从 jQuery 迁移过来,所以对于熟悉 JQuery 的开发人员来说是非常好的选择

1.初始化

(1)字符串初始化

实例代码:

html = '''
<div>
    <ul>
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq

doc = pq(html)
print(doc('li'))

运行结果:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

(2)URL初始化

实例代码:

from pyquery import PyQuery as pq

doc = pq(url='http://www.baidu.com')
print(doc('head'))

运行结果:

<head><meta http-equiv="content-type" content="text/html;charset=utf-8"/><meta http-equiv="X-UA-Compatible" content="IE=Edge"/><meta content="always" name="referrer"/><link rel="stylesheet" type="text/css" href="http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css"/><title>ç™¾åº¦ä¸€ä¸‹ï¼Œä½ å°±çŸ¥é“</title></head>

(3)文件初始化

实例代码:

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('li'))

2.基本CSS选择器

实例代码:

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))

运行结果:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

3.查找元素

(1)子元素

实例代码一:

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''

from pyquery import PyQuery as pq

doc = pq(html)
li = doc('.list').find('li')
print(li)

运行结果:

<li class="item-0">first item</li>
<li class="item-1"><a href="link2.html">second item</a></li>
<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
<li class="item-1 active"><a href="link4.html">fourth item</a></li>
<li class="item-0"><a href="link5.html">fifth item</a></li>

当然,除了使用 find 方法以外,我们还能使用 children 方法

实例代码二:

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''

from pyquery import PyQuery as pq

doc = pq(html)
items = doc('.list')
lis = items.children('.active')
print(lis)

运行结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>

(2)父元素

实例代码一:

html = '''
<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>
'''

from pyquery import PyQuery as pq

doc = pq(html)
items = doc('.list')
container = items.parent()
print(container)

运行结果:

<div id="container">
    <ul class="list">
         <li class="item-0">first item</li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
         <li class="item-1 active"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a></li>
     </ul>
 </div>

使用 parent() 是返回直接父节点,但是使用 parents()能返回全部的父节点

实例代码二:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')
parents = items.parents('.wrap')
print(parents)

运行结果:

<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>

(3)兄弟节点

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.list .item-0.active')
print(li.siblings('.active'))

运行结果:

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

4.遍历

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''

from pyquery import PyQuery as pq

doc = pq(html)
lis = doc('li').items()
for i in lis:
    print(i)

运行结果:

<li class="item-0">first item</li>

<li class="item-1"><a href="link2.html">second item</a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-1 active"><a href="link4.html">fourth item</a></li>

<li class="item-0"><a href="link5.html">fifth item</a></li>

5.获取信息

(1)获取属性

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''

from pyquery import PyQuery as pq

doc = pq(html)
a = doc('.list .item-0.active a')
print(a.attr.href)
print(a.attr('href'))

运行结果:

link3.html
link3.html

(2)获取文本

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
a = doc('.item-0.active a')
print(a.text())

运行结果:

third item

(3)获取 HTML

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li.html())

运行结果:

<a href="link3.html"><span class="bold">third item</span></a>

6.DOM操作

(1)addClass、removeClass

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.removeClass('active')
print(li)
li.addClass('active')
print(li)

运行结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

(2)attr、css

实例代码:

html = '''
<div class="wrap">
    <div id="container">
        <ul class="list">
             <li class="item-0">first item</li>
             <li class="item-1"><a href="link2.html">second item</a></li>
             <li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>
             <li class="item-1 active"><a href="link4.html">fourth item</a></li>
             <li class="item-0"><a href="link5.html">fifth item</a></li>
         </ul>
     </div>
 </div>
'''
from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.item-0.active')
print(li)
li.attr('name','link')
print(li)
li.css('front-size','14px')
print(li)

运行结果:

<li class="item-0 active"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link"><a href="link3.html"><span class="bold">third item</span></a></li>

<li class="item-0 active" name="link" style="front-size: 14px"><a href="link3.html"><span class="bold">third item</span></a></li>

(3)remove

实例代码:

html = '''
<div class="wrap">
    Hello, World
    <p>This is a paragraph.</p>
 </div>
'''

from pyquery import PyQuery as pq

doc = pq(html)
wrap = doc('.wrap')
print(wrap.text())
wrap.find('p').remove()
print(wrap.text())

运行结果:

Hello, World
This is a paragraph.
Hello, World

(4)其他

https://pyquery.readthedocs.io/en/latest/api.html

0X08 Selenium

该函数库可以配合各种浏览器引擎以及 phantomJS 进行自动化测试工作,主要是为了解决 JS 动态渲染页面无法直接抓取的问题

1.基本使用

实例代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
    input = browser.find_element_by_id('kw')
    input.send_keys('Python')
    input.send_keys(Keys.ENTER)
    wait = WebDriverWait(browser, 10)
    wait.until(EC.presence_of_element_located((By.ID, 'content_left')))
    print(browser.current_url)
    print(browser.get_cookies())
    print(browser.page_source)
finally:
    browser.close()

2.声明对象

from selenium import webdriver

browser = webdriver.Chrome()
browser = webdriver.Firefox()
browser = webdriver.Edge()
browser = webdriver.PhantomJS()
browser = webdriver.Safari()

3.访问页面

from selenium import webdriver

browser = webdriver.Chrome()
browser.get("https://www.baidu.com")
print(browser.page_source)
browser.close()

4.查找元素

(1)查找单个元素

实例代码一:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element_by_id('q')
input_second = browser.find_element_by_css_selector('#q')
input_third = browser.find_element_by_xpath('//*[@id="q"]')
print(input_first, input_second, input_third)
browser.close()

运行结果:

<selenium.webdriver.remote.webelement.WebElement (session="06448c4d710820390f33d87c3033a505", element="0.38405353494037175-1")> <selenium.webdriver.remote.webelement.WebElement (session="06448c4d710820390f33d87c3033a505", element="0.38405353494037175-1")> <selenium.webdriver.remote.webelement.WebElement (session="06448c4d710820390f33d87c3033a505", element="0.38405353494037175-1")>

补充:

除此之外还有一些查找元素的方法,如下

find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

实例代码二:

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
input_first = browser.find_element(By.ID, 'q')
print(input_first)
browser.close()

运行结果:

<selenium.webdriver.remote.webelement.WebElement (session="1f209c0d11551c40d9d20ad964fef244", element="0.07914603542731591-1")>

(2)查找多个元素

实例代码一:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements_by_css_selector('.service-bd li')
print(lis)
browser.close()

实例代码二:

from selenium import webdriver
from selenium.webdriver.common.by import By

browser = webdriver.Chrome()
browser.get('https://www.taobao.com')
lis = browser.find_elements(By.CSS_SELECTOR, '.service-bd li')
print(lis)
browser.close()

补充:

除了上面的查找方法,查找多个元素还有下面的一些常见的方法:

find_elements_by_name find_elements_by_xpath
find_elements_by_link_text find_elements_by_partial_link_text
find_elements_by_tag_name find_elements_by_class_name
find_elements_by_css_selector

(3)元素的交互操作

我们可以对获取的元素调用交互方法

实例代码:

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get("http://www.taobao.com")
input = browser.find_element_by_id('q')
input.send_keys('iphone')
time.sleep(1)
input.clear()
input.send_keys('ipad')
button = browser.find_element_by_class_name('btn-search')
button.click()

补充:

官方文档:
http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.remote.webelement

(4)交互动作

将动作附加到动作链中串行执行,这是我们使用 selenium 去模拟键鼠操作的非常常用的东西

实例代码:

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
target = browser.find_element_by_css_selector('#droppable')
actions = ActionChains(browser)
actions.drag_and_drop(source, target)
actions.perform()

补充:

官方文档:
http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.common.action_chains

ActionChains方法列表
click(on_element=None) ——单击鼠标左键
click_and_hold(on_element=None) ——点击鼠标左键,不松开
context_click(on_element=None) ——点击鼠标右键
double_click(on_element=None) ——双击鼠标左键
drag_and_drop(source, target) ——拖拽到某个元素然后松开
drag_and_drop_by_offset(source, xoffset, yoffset) ——拖拽到某个坐标然后松开
key_down(value, element=None) ——按下某个键盘上的键
key_up(value, element=None) ——松开某个键
move_by_offset(xoffset, yoffset) ——鼠标从当前位置移动到某个坐标
move_to_element(to_element) ——鼠标移动到某个元素
move_to_element_with_offset(to_element, xoffset, yoffset)
——移动到距某个元素(左上角坐标)多少距离的位置
perform() ——执行链中的所有动作
release(on_element=None) ——在某个元素位置松开鼠标左键
send_keys(keys_to_send) ——发送某个键到当前焦点的元素
send_keys_to_element(element,
keys_to_send) ——发送某个键到指定元素

(5)执行JavaScript

当我们找不到现成的 api 的时候,我们可以使用 js 来帮助我们实现一些动作,比如进度条的拖拽等

实例代码:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('https://www.zhihu.com/explore')
browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
browser.execute_script('alert("To Bottom")')
browser.close()

5.获取元素信息

(1)获取属性

实例代码:

from selenium import webdriver
from selenium.webdriver import ActionChains

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
logo = browser.find_element_by_id('zh-top-link-logo')
print(logo)
print(logo.get_attribute('class'))
browser.close()

(2)获取文本值

实例代码:

from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input.text)
browser.close()

(3)获取ID、位置、标签名、大小

实例代码:

from selenium import webdriver

browser = webdriver.Chrome()
url = 'https://www.zhihu.com/explore'
browser.get(url)
input = browser.find_element_by_class_name('zu-top-add-question')
print(input.id)
print(input.location)
print(input.tag_name)
print(input.size)
browser.close()

6.Frame 操作

如果出现 frame 或者 iframe 我们必须要进入这个区域才能进行操作

实例代码:

import time
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException

browser = webdriver.Chrome()
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
browser.get(url)
browser.switch_to.frame('iframeResult')
source = browser.find_element_by_css_selector('#draggable')
print(source)
try:
    logo = browser.find_element_by_class_name('logo')
except NoSuchElementException:
    print('NO LOGO')
finally:
    browser.close()
browser.switch_to.parent_frame()
logo = browser.find_element_by_class_name('logo')
print(logo)
print(logo.text)

7.等待

(1)隐式等待

这个方法是针对网页中的 ajax 请求设计的,当 webdriver 查找元素或元素并没有立即出现的时候(可能还需要后期的 ajax 请求),隐式等待将等待一段时间再查找 DOM,默认的时间是0

实例代码:

from selenium import webdriver

browser = webdriver.Chrome()
browser.implicitly_wait(10)
browser.get('https://www.zhihu.com/explore')
input = browser.find_element_by_class_name('zu-top-add-question')
print(input)
browser.close()

运行结果:

<selenium.webdriver.remote.webelement.WebElement (session="a607aed65e614d975de7a1273522ff3a", element="0.7555513559347986-1")>

(2)显式等待

显示等待会设置一个条件和一个最长等待时间,如果在这个最长等待时间内条件还是没有成立才会抛出异常

实例代码:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

browser = webdriver.Chrome()
browser.get('https://www.taobao.com/')
wait = WebDriverWait(browser, 10)
input = wait.until(EC.presence_of_element_located((By.ID, 'q')))
button = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '.btn-search')))
print(input, button)
browser.close()

运行结果:

<selenium.webdriver.remote.webelement.WebElement (session="c3730876d29127a08cdcdb54a664600f", element="0.37070383186598255-1")> <selenium.webdriver.remote.webelement.WebElement (session="c3730876d29127a08cdcdb54a664600f", element="0.37070383186598255-2")>

补充:

常见的判断条件:

title_is 标题是某内容
title_contains 标题包含某内容
presence_of_element_located 元素加载出,传入定位元组,如(By.ID, ‘p’)
visibility_of_element_located 元素可见,传入定位元组
visibility_of 可见,传入元素对象
presence_of_all_elements_located 所有元素加载出
text_to_be_present_in_element 某个元素文本包含某文字
text_to_be_present_in_element_value 某个元素值包含某文字
frame_to_be_available_and_switch_to_it frame加载并切换
invisibility_of_element_located 元素不可见
element_to_be_clickable 元素可点击
staleness_of 判断一个元素是否仍在DOM,可判断页面是否已经刷新
element_to_be_selected 元素可选择,传元素对象
element_located_to_be_selected 元素可选择,传入定位元组
element_selection_state_to_be 传入元素对象以及状态,相等返回True,否则返回False
element_located_selection_state_to_be 传入定位元组以及状态,相等返回True,否则返回False
alert_is_present 是否出现Alert

官方文档:
http://selenium-python.readthedocs.io/api.html#module-selenium.webdriver.support.expected_conditions

8.前进后退

实例代码:

from selenium import webdriver 

browser = webdriver.Chrome()
browser.get("http://www.baidu.com")
browser.get("http://www.taobao.com")
browser.get("http://www.zhihu.com")
browser.back()
browser.forward()
browser.close()

9.Cookies

实例代码:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://www.baidu.com')
print(browser.get_cookies())
browser.add_cookie({'name':'Tom','pass':'123456','value': 'germey'})
print(browser.get_cookies())
browser.delete_all_cookies()
print(browser.get_cookies())
browser.close()

10.选项卡操作

实例代码:

from selenium import webdriver
import time

browser = webdriver.Chrome()
browser.get('http://www.baidu.com')
browser.execute_script('window.open()')
print(browser.window_handles)
browser.switch_to.window(browser.window_handles[1])
browser.get('http://www.taobao.com')
time.sleep(1)
browser.switch_to.window(browser.window_handles[0])
browser.get('http://httpbin.org')
browser.close()

运行结果:

['CDwindow-3FCC47842DFF6841B4C86EE72CB7DB93', 'CDwindow-CCFA4494DE4B6C99494BE87524153E4E']

11.异常处理

实例代码:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException, NoSuchElementException

browser = webdriver.Chrome()
try:
    browser.get('https://www.baidu.com')
except TimeoutException:
    print('Time Out')
try:
    browser.find_element_by_id('hello')
except NoSuchElementException:
    print('No Element')
finally:
    browser.close()

运行结果:

No Element

补充:

官方文档:详细文档:http://selenium-python.readthedocs.io/api.html#module-selenium.common.exceptions